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\  Using  Rules  and  Task  Division  to  Augment  Connectionist  Learning 

William  L.  Oliver  and  Walter  Schneider 

Learning  Research  and  Development  Center 
University  of  Pittsburgh 

Abstract 

l. earning  as  a  function  of  task  complexity  was  examined  in  human  learning  and  two  connec¬ 
tionist  simulations .  An  example  task  involved  learning  to  map  basic  input/ourput  digital  logic 
functions  for  six  digital  gates  (AND  OR,  XOR  and  negated  versions)  with  2-  or  6-inputs. 
Humans  given  instruction  learned  the  task  in  about  300  trials  and  showed  no  effect  of  the 
number  of  inputs.  Backpropagation  learning  in  a  network  with  20  hidden  units  required 
68,000  trials  and  scaled  poorly,  requiring  8  times  as  many  trials  to  learn  the  6-input  gates  as  to 
learn  the  2-input  gates.  A  second  simulation  combined  backpropagation  with  task  division 
based  upon  rules  humans  use  to  perform  the  task.  The  combined  approach  improved  the  scal¬ 
ing  of  the  problem,  learning  in  3,100  trials  and  requiring  about  3  times  as  many  trials  to  learn 
the  6-input  gates  as  to  learn  the  2-input  gates.  Issues  regarding  scaling  and  augmenting  con¬ 


nectionist  learning  with  rule-based  instruction  are  discussed „ 

Introduction  ( 
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In  this  paper  we  compare  human  learning  of  a  modestly  complex  task  with  connectionist 
learning  that  used  the  procedure  known  as  "backpropagation"  (Rumelharr,  Hinton  &  Williams. 
1986).  We  also  consider  a  model  that  uses  rules  to  divide  the  task  into  subtasks  that  can  be 


separately  learned  with  backpropagation.  We  examine  the  benefits  of  providing  a  connectionist 
system  with  a  rule-based  instructor  that  can  reconfigure  the  system  via  attention  to  learn  com¬ 
ponents  of  the  task. 

A  critical  issue  for  artificial  intelligence  and  human  learning  involves  finding  learning  algo¬ 
rithms  that  scale  well.  Learning  time  for  an  algorithm  should  not  increase  so  dramatically  with 
task  complexity  that  it  can  only  be  applied  to  toy  problems.  Minsky  and  Papert  (1988,  p.  262) 
comment  on  the  importance  of  the  scale  issue  stating:  "In  the  examination  of  theories  of  learn¬ 
ing  and  problem  solving,  the  study  of  such  growths  in  cost  is  not  merely  one  more  aspect  to  be 
taken  into  account;  it  is  the  only  aspect  worth  considering." 

To  the  psychologist  the  problem  of  scale  has  critical  in  com: ice  because  the  time  a  biologi¬ 
cal  system  has  to  learn  is  limited.  A  learning  algorithm  tn  aes  not  allow  the  organism  to 
ieam  a  task  in  its  lifetime  is  of  limited  value. 


Current  connectionist  algorithms  may  scale  too  poorly  to  account  for  human  learning  in 
many  instances.  Many  tasks  may  be  learned  far  more  quickly  by  humans  than  by  currently 
available  connectionist  procedures,  because  human  learning  can  be  guided  by  rules.  Below  we 
describe  such  a  task  in  which  humans  required  around  300  trials  to  learn.  In  contrast,  currently 
our  fastest  learning  simulations  using  only  backpropagation  required  68,000  trials,  (see  Figure 
2  below).  More  importantly,  human  learning  time  did  not  increase  with  increases  in  the  com¬ 
plexity  of  the  task,  whereas  the  learning  times  for  the  connectionist  procedure  significantly 
increased. 


The  study  of  connectionist  learning  is  partially  supported  by  an  implicit  assumption  that 
humans  provide  an  existence  proof  for  simple,  powerful  learning  algorithms  that  scale  wed. 
This  assumption  is  likely  to  be  false.  By  simple  learning  algorithms  we  mean  algorithms  that 
can  map  inputs  to  outputs  by  altering  connection  weights  on  each  trial  giver,  the  input  and  the 
desired  output  state  ot  the  system.  This  learning  occurs  without  using  explicit  rules  or  focus- 
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ing  the  network’s  attention  on  specific  parts  of  the  problem.  Human  learning  in  such  situations 
is  poor  and  does  not  scale  well.  Subjects  take  many  trials  to  learn  simple  concepts  involving 
very  few  feature  dimensions  (usually  about  4)  in  psychological  studies  in  which  subjects  are 
discouraged  from  forming  verbal  rules  (e.g.,  Medin  and  Schaffer,  1978).  Humans  benefit 
greatly  from  focusing  attention,  instruction,  hypothesis  generation,  and  learning  by  imitation, 
none  of  which  is  present  in  traditional  connectionist  learning  models.  When  learning  a  com¬ 
plex  problem,  such  as  family  hierarchies  (Hinton,  1985),  a  connectionist  procedure  must  de¬ 
velop  internal  representations  solely  from  the  inputs  and  outputs  that  are  specified  on  each 
learning  trial.  There  is  no  mechanism  to  directly  instruct  the  network  about  relationships 
among  features  (e.g.,  that  female  and  daughter  are  correlated  features  such  that  daughters  are 
always  female).  The  backpropagation  procedure  can  learn  simple  tasks  of  this  sort,  but  learn¬ 
ing  often  requires  thousands  of  trials.  We  believe  that  both  simple  learning  algorithms  and 
rule-based  learning  will  be  necessary  to  account  for  human  learning. 

The  human  learning  of  chicken  sexing  (identifying  young  chicks  as  males  or  females)  pro¬ 
vides  a  contrast  between  learning  by  input-output  mapping  and  learning  by  instruction  on  rules. 
Until  recently  chicken  sexers  had  to  learn  their  task  on  the  basis  of  feedback  from  experts  and 
on-the-job  practice.  It  was  claimed  to  have  taken  years  for  people  to  become  proficient  at  this 
task  (Biederman  &  Shifftar,  1987).  Biederman  and  Shifffar  demonstrated  that  college  students 
could  perform  a  variant  of  the  chicken  sexing  task  as  well  as  experts  when  provided  with  a 
classification  rule.  Only  about  a  minute  was  needed  to  instruct  the  subjects  on  this  rule,  which 
focused  subjects'  attention  on  particular  features  and  told  them  how  to  respond  given  the  pres¬ 
ence  of  those  features.  This  example  suggests  that  humans  can  learn  complex  relations  via  re¬ 
inforced  input-output  mapping,  but  this  learning  method  scales  poorly  and  can  be  greatly 
improved  by  using  attentional  and  instructional  operations  that  are  generally  absent  in  connec- 
tionist  learning. 

We  are  examining  connectionist  architectures  that  include  attentional  focusing  and  instruc¬ 
tion-based  learning  (Schneider  &  Detweiler,1987;  Schneider  &  Mumme,1988;  Schneider  & 
Oliver,  1988).  These  architectures  combine  features  from  connectionist  and  production-system 
models.  Rule-based  processing  allows  an  attentional  mechanism  to  dynamically  reconfigure 
connectionist  networks  so  that  critical  features  become  salient  and  a  task  can  be  decomposed 
into  subtasks  of  smaller  scale.  Using  rules  allows  rapid  initial  learning  of  the  components  of 
the  task  and  the  serial  execution  of  each  component,  as  occurs  in  Anderson's  (1983)  ACT*  or 
Laird,  Rosenbloom  and  Newell's  (1986)  SOAR.  Connectionist  learning  within  the  architecture 
can  convert  serial  processing  of  the  component  rules  to  parallel  processing  as  a  consequence  of 
practice.  In  addition,  the  mutual  constraint  nature  of  connectionist  processing  provides  a  best- 
match  mapping  of  inputs  to  outputs  that  is  less  brittle  than  rule-based  matching  processes. 

In  this  paper  we  examine  the  benefits  of  task  decomposition  by  comparing  the  human 
learner  to  a  connectionist  learning  system  with  and  without  task  decomposition.  We  examine 
the  effect  of  learning  as  a  function  of  the  complexity  of  the  task.  The  task  involved  learning 
digital  input-output  mappings  for  six  digital  logic  gates  (AND,  OR,  XOR  and  the  negated 
forms  of  the  rules)  for  either  2,  4  or  6  inputs  per  gate.  We  have  studied  this  task  extensively  in 
the  acquisition  of  human  troubleshooting  skill  (Carlson,  Sullivan  &  Schneider,  1988a,  1988b). 
When  learning  this  task,  human  subjects  describe  their  processing  as  having  three  stages.  The 
first  stage  is  encoding  the  inputs  as  all  l's,  all  0's,  or  mixed.  The  second  stage  is  mapping  the 
coded  input  and  the  gate  type  to  the  expected  output  of  a  0  or  1.  The  third  stage  involves  ap¬ 
plying  the  negation  operator  when  it  is  required  to  reverse  the  output.  Subjects  were  instructed 
on  rules  for  each  stage  and  then  required  to  learn  2-,  4-or  6-  input  gate  problems. 
Connectionist  learning  without  decomposition  was  examined  in  a  network  that  mapped  the  in¬ 
puts  to  the  outputs  through  a  single  layer  of  hidden  units.  Input-output  pairs  were  presented  to 
the  network,  and  backpropagation  learning  (Rumclliart  et  ai.,  1986)  was  used  to  modify  the 
connection  weights.  Connectionist  learning  with  decomposition  was  examined  in  a  network 
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composed  of  three  modules,  one  for  each  stage.  Each  module  had  an  input  layer  and  an  output 
layer.  During  training,  each  module  received  input  and  output  information  for  each  stage  and 
propagated  error  only  within  its  own  stage. 

Human  Learning  of  Digital  Logic 

The  computational  properties  of  connectionist  models  have  been  studied  by  examining  how 
they  learn  boolean  functions  (e.g.,  Minsky  &  Papert,  1988;  Rumelhart  et  al.,  19G6;  Voiper  &. 
Hampson,  1986).  Interestingly,  research  on  digital  trouble  shooting  has  also  looked  at  how 
subjects  learn  boolean  logic  in  the  laboratory  (Brooke  &  Duncan,  1983;  Carlson  et  al.,  1988a, 
1988b).  In  order  to  compare  a  connectionist  model's  learning  with  human  learning,  we  de¬ 
signed  an  experiment  that  required  subjects  to  learn  several  boolean  functions  and  later  had  the 
model  leam  the  same  set  of  functions.  We  were  mainly  interested  in  whether  increasing  the 
complexity  of  the  task  by  increasing  the  number  of  inputs  to  the  functions  would  make  the  task 
much  more  difficult  to  leam. 

The  subjects  in  this  experiment  were  University  of  Pittsburgh  undergraduates  with  no  ex¬ 
perience  in  digital  logic.  A  between-subjects  experimental  design  was  used;  one  group  of  8 
subjects  learned  digital  logic  gates  with  2  inputs  and  another  group  of  9  subjects  learned  gates 
with  6  inputs.  The  subjects'  task  was  to  leam  the  rules  for  the  gates  to  a  high  level  of  accuracy 
while  responding  as  quickly  as  possible.  Subjects  typically  reach  an  asymptotic  accuracy  of 
only  about  92%  in  this  task  (Carlson,  et  al.,  1988b).  Their  errors  are  random,  suggesting 
causes  other  than  rule  learning  (e.g.,  attention  shifts,  speed-accuracy  trade-offs)  for  the  less- 
than-perfect  performance. 

The  subjects  learned  six  digital  logic  ruies--AND,  NAND,  OR,  NOR,  XOR,  XNOR.  The 
subjects  predicted  the  correct  outputs  when  given  different  combinations  of  0's  and  l's  as  in¬ 
puts  for  the  various  logic  gates.  The  inputs  to  the  gates  were  randomly  determined  with  certain 
constraints  on  each  trial  (see  below).  The  gates  and  their  inputs  appeared  one  at  a  time  on  a 
CRT  screen,  and  the  subjects  indicated  the  correct  output  (0  or  1)  by  pressing  labelled  keys.  A 
computer  controlled  the  sequencing  and  presentation  of  the  stimuli  and  gathered  data  on  the  ac¬ 
curacy  and  speed  of  the  subjects'  responses.  Feedback  on  the  correctness  of  response  was 
provided  after  each  trial.  The  subjects  were  given  verbal  rules  during  the  early  part  of  the  ex¬ 
periment  for  each  gate,  such  as  the  following  rule  for  the  AND  gate:  "if  the  the  inputs  are  all  l's 
respond  1;  if  the  inputs  are  mixed  (0's  and  l’s)  respond  0;  and  if  the  inputs  are  all  0's  respond 
0.”  When  a  help  key  was  pressed,  the  appropriate  rules  for  a  gate  appeared  in  the  upper-left- 
hand  corner  of  the  screen.  An  introduction  to  the  three  gate  types  (AND,  OR,  and  XOR)  in¬ 
volving  24  trials  per  gate  was  followed  by  36  practice  trials  responding  to  gates  and  inputs 
selected  at  random.  The  subjects  were  then  given  instructions  on  how  to  carry  out  negation  for 
the  different  gates  (NAND,  NOR,  and  XNOR)  and  given  24  trials  of  practice  on  each  of  these 
gate  types.  An  additional  36  practice  trials  followed  in  which  the  negated  gates  were  selected  at 
random  and  presented  to  the  subjects.  In  the  final  part  of  the  experiment,  the  subjects  re¬ 
sponded  to  300  gates  selected  at  random  from  the  entire  set,  including  negated  gates.  The 
subjects  could  rest  briefly  alter  blocks  of  50  trials,  and  use  of  the  help  key  was  not  permitted. 

In  order  to  vary  the  complexity  of  the  task,  the  number  of  inputs  to  the  gates  differed  be¬ 
tween  groups  of  subjects.  One  group  of  subjects  saw  gates  with  2  inputs  and  another  group 
saw  gates  with  6  inputs.  Because  increasing  the  number  of  inputs  dramatically  changes  the 
proportion  of  1  and  0  responses  for  a  given  gate,  a  constraint  was  placed  on  the  sampling  of 
input  combinations  for  the  6-input  condition.  For  the  6-input  gates,  the  probability  of  sampling 
certain  input  combinations  (e.g.,  the  all  l's  case  for  the  AND  gate)  was  increased  to  maintain 
the  same  proportions  of  0  and  1  responses  as  occurred  in  the  2-input  condition.  Without  this 
constraint  on  the  generation  of  input  combinations,  the  subjects  would  be  biased  towards  al- 
ways  giving  the  same  response  for  a  particular  gate— for  example,  they  would  be  biased 
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towards  responding  0  to  every  AND  gate  because  the  probability  of  that  answer  being  correct 
would  be  .98. 

Human  Learning  Results 

The  subjects  responded  correctly  on  a  high  proportion  of  trials  (92%)  during  the  final  300 
trials  of  practice.  The  mean  percentages  of  correct  responses  over  50-trial  blocks  were  89,  90, 
94,  95,  94,  and  93%  for  blocks  1  through  6  respectively.  Hence,  the  subjects  started  this  final 
pan  of  the  experiment  with  high  accuracy  and  became  somewhat  more  accurate  with  the  addi¬ 
tional  practice.  An  analysis  of  variance  that  included  the  variables  for  input  condition  and  50- 
trial  blocks  indicated  that  there  were  significant  differences  in  accuracy  among  the  blocks, 
F(5,75)=4.60,  pc.001.  The  main  effect  for  input  condition  was  not  significant,  F(1,15)<1, 
nor  did  input  condition  interact  with  blocks,  F(5,75)<1.  The  mean  accuracies  were  92%  for 
the  2-input  condition  and  93%  for  the  6-input  condition. 

An  analysis  of  the  subjects'  response  times  also  failed  to  show  differences  between  the  2- 
and  6-input  conditions.  The  subjects  responded  faster,  on  average,  to  the  6-input  gates  (2.18 
seconds)  than  to  the  2-input  gates  (2.31  seconds),  but  this  difference  was  not  significant, 
F(1,15)<1.  As  one  might  expect,  there  was  a  significant  speed-up  over  blocks, 
F(5,75)=14.52,  pc.OOl;  the  means  for  the  eight  50-trial  blocks,  beginning  with  block  1,  were 
2.77,  2.34,  2.19,  2.10,  2.02,  and  1.90  seconds.  The  variables  input  condition  and  50-trial 
block  did  not  significantly  interact,  F(5,75)=1.15,  p>.34. 

In  summary,  the  initial  216  trials  of  training  brought  the  subjects  to  a  high  level  of 
accuracy.  The  final  test  blocks  showed  that  the  subjects  could  maintain,  and  even  improve, 
this  accuracy  when  they  were  tested  on  the  different  gates  at  random.  There  was  no  indication 
that  the  6-input  gates  were  more  difficult  to  learn  than  the  2-input  gates. 

Connectionist  Learning  Without  Task  Division 

We  also  examined  connectionist  learning  of  the  digital  logic  task  using  the  backpropagation 
learning  procedure.  A  software  package  developed  by  McClelland  and  Rumelhart  (1988)  was 
used  to  model  the  task.  To  find  out  how  changing  the  number  of  inputs  would  affect  learning, 
we  modelled  learning  of  2-,  4-  and  6-input  gates. 

The  networks  trained  with  backpropagation  were  feed-forward  networks  having  either  6,  8 
or  10  units  in  the  input  layer.  Each  network  had  20  hidden  units,  and  a  single  output  unit.  The 
input  layer  consisted  of  3  units  to  encode  gate  type,  1  unit  to  encode  negation,  and  2,  4  or  6 
units  to  encode  the  inputs  (0's  or  l's)  to  the  gates.  Figure  1  illustrates  the  network's 
configuration  for  learning  the  gates  with  6  inputs.  Different  codes  were  used  for  the  AND 
(100),  OR  (010),  and  XOR  (001)  gates,  and  the  negation  unit  was  set  to  1  to  represent  the 
negated  gates  (NAND,  NOR,  and  XNOR)  and  otherwise  set  to  0.  The  initial  weights  for  the 
network  were  set  to  random  values  that  varied  uniformly  between  -0.5  and  0.5.  The  momen¬ 
tum  parameter  was  set  to  0.9.  We  tried  a  number  of  different  learning  rate  parameters,  and  the 
simulations  we  report  below  used  the  parameters  that  yielded  the  fastest  learning.  These 
learning  rate  parameters  were  .1,  .07,  and  .02  for  the  2-,  4-,  and  6-input  networks  respec¬ 
tively.  The  learning  rates  had  to  be  reduced  as  the  number  of  input  units  were  increased  to 
yield  reasonably  stable  learning  times. 

Following  the  usual  procedure  for  backpropagation,  the  networks  were  repeatedly  pre¬ 
sented  with  the  complete  set  of  patterns  to  be  learned  in  cycles  or  "epochs."  The  networks  were 
presented  with  patterns  corresponding  to  all  possible  feature  combinations  for  the  gates  and 
their  inputs.  Particular  patterns  in  the  4-  and  6-input  simulations  were  repeatedly  presented  to 
the  network  within  epochs  to  achieve  the  same  proportion  of  l  and  0  responses  that  subjects 
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Figure  1.  The  configuration  of  the  network  that  learned  the  6-input  gates 
without  task  division. 


Figure  2.  Trials  to  criterion  for  humans,  backpropagation  alone,  and 
backpropagation  with  stages.  The  bars  represent  standard  deviations. 


had  encountered  in  the  experiment  described  above.  The  weights  were  adjusted  after  each  pat¬ 
tern  so  that  the  network  learned  over  epochs  to  respond  to  the  patterns  with  the  appropriate  0’s 
and  l's. 

Each  network's  accuracy  was  tested  at  10-epoch  intervals  during  learning  by  presenting  the 
set  of  training  patterns  to  the  network  while  learning  was  turned  off.  A  network's  response 
was  assumed  to  be  a  1  if  the  activation  of  the  output  exceeded  .5.  and  0  if  its  activation  was 
less  than  .5  (possible  activation  values  varied  between  0  and  1).  Ten  simulations  were  run  for 
the  different  network  configurations,  each  starting  with  different  random  weights. 

Figure  2  shows  the  number  of  trials  (number  of  epochs  times  number  of  patterns  per 
epoch)  needed  tor  each  network  to  learn  to  the  criterion  of  100%  accuracy.  This  criterion  was 
used  because  the  network's  behavior  was  deterministic;  if  the  network  was  less  than  perfect  it 
would  always  err  on  the  same  patterns.  These  systematic  errors,  which  are  uncharacteristic  of 
our  subjects  who  pertormed  above  90%  accuracy,  were  taken  to  mean  that  the  network  had  not 
yet  learned  the  task. 
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Figure  3.  The  configuration  for  the  network  that  used  task  division  to  learn  the 
6*input  gates. 

As  the  complexity  of  the  task  increased,  there  was  a  substantial  growth  in  the  number  of 
trials  necessary  to  train  the  networks.  Note  that  this  growth  contrasts  dramatically  with  the  lack 
of  any  complexity  effect  in  the  human  data.  This  growth  apparently  resulted  from  the  expo¬ 
nential  increase  in  the  number  of  patterns  to  be  learned  by  the  network;  the  number  of  patterns 
to  be  learned  doubled  with  each  additional  input.  There  were  24,  96,  and  384  patterns  to  be 
learned  in  the  2-,  4,  and  6-input  conditions  respectively.  Generalization  of  learning  among  the 
patterns  was  insufficient  to  hold  down  the  learning  time. 

Connectionist  Learning  with  Task  Division 

Human  learning  may  scale  well  in  our  task  because  of  the  subjects'  abilities  to  divide  the 
task  into  component  tasks.  These  component  tasks  can  be  separately  focused  on  during  both 
instructions  and  performance  of  the  task.  The  subjects'  prior  knowledge  allows  them  to  be  in¬ 
structed  on  the  rules  that  apply  to  the  component  task  and  would,  even  in  the  absence  of 
explicit  instructions,  allow  them  to  form  hypotheses  about  which  feature  combinations  might 
be  important.  Such  task  division  and  use  of  prior  knowledge  are,  of  course,  standard  features 
in  many  simulations  of  cognitive  processes,  e.g.,  Anderson's  ACT*  (1983).  Furthermore,  the 
notion  of  information  processing  stages  has  played  a  fundamental  role  in  cognitive  psychology. 
Much  reseach  has  been  designed  to  identify  stages  of  processing  and  discover  how  they 
interact  (e.g.,  Sternberg,  1969). 

To  examine  how  task  division  might  speed  up  learning  in  our  task,  we  used  backpropaga- 
tion  to  learn  the  individual  component  tasks  in  a  modular  network.  Figure  3  illustrates  how  the 
units  that  coded  the  gate  inputs,  gate  type,  and  negation  were  used  as  inputs  to  the  modules. 
The  figure  also  shows  how  the  outputs  from  one  module  became  the  inputs  to  another  module. 
The  model  had  three  modules,  each  containing  a  layer  of  input  units,  a  layer  of  10  hidden  units, 
and  a  layer  of  output  units.  The  first  module  (input  map)  was  trained  to  recode  2,  4,  or  6  in¬ 
puts  of  0's  and  l's  into  codes  representing  either  "all  0's",  "all  l's",  or  "mixed."  The  second 
module  (gate  map)  was  trained  to  produce  the  correct  responses  (1  or  0)  when  given  the  re¬ 
coded  inputs  and  the  codes  for  the  gate  types  (AND,  OR,  XOR).  The  third  module  (negation'1 
was  trained  to  negate  the  output  of  the  second  module  when  negation  was  called  for. 

To  assess  total  times  for  the  model  to  learn  the  task,  learning  simulations  were  run  for  eacn 
module.  Our  results  on  learning  umes  are  based  on  10  runs  for  each  simulation.  Each  run  was 
initialized  to  use  a  different  set  of  random  weights  uniformly  distributed  between  -.5  and  +.5. 
For  all  modules,  the  momentum  parameter  was  .9.  The  learning  rate  parameters  for  the  input- 
map  module  were  .5, .  1,  and  .05  for  the  2-,  4-,  and  6-input  conditions  respectively.  The 
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Figure  4.  Trials  to  criterion  as  a  function  of  subtasks  and  no.  of  inputs.  The 
bars  represent  standard  deviations. 


learning  rate  parameters  were  .1  for  the  gate-map  module  and  .5  for  the  negation  module. 
These  learning  rate  parameters  were  selected  to  enable  rapid  learning,  but  no  major  effort  was 
taken  to  find  the  best  parameters. 

Figure  4  shows  the  mean  number  of  trials  needed  to  learn  the  component  tasks  for  the  dif¬ 
ferent  numbers  of  gate  inputs.  Figure  4  shows  that  recoding  the  input  as  l's,  0's,  and  mixed 
requires  substantially  more  trials  as  the  number  of  inputs  is  increased.  Assuming  that  learning 
can  occur  for  all  three  modules  during  each  trial,  learning  time  would  depend  principally  on  the 
module  that  took  the  maximum  number  of  trials  to  learn.  This  maximum  value  is  plotted  in 
Figure  2.  It  is  clear  from  the  figure  that  learning  in  this  case  scales  considerably  better  than 
learning  with  backpropagation  alone.  It  should  be  pointed  out,  however,  that  Figure  4  sug¬ 
gests  that  many  more  trials  would  be  needed  to  learn  gates  with  more  than  six  inputs.  If 
presented  with  more  inputs,  the  subjects  would  probably  adopt  additional  coding  processes  to 
cope  with  increasing  complexity,  as  is  thought  to  occur  when  subjects  chunk  visual  stimuli  into 
familiar  configurations  (Bartram,  1978).' 


Discussion 

We  have  examined  human  and  connect,  onist  learning  of  a  modestly  complex  problem.  The 
human  subjects  learned  the  task  very  quickly,  reaching  90%  accuracy  by  the  second  block  of 
distributed  practice.  There  was  no  evidence  of  any  problem  of  scaling  in  the  human  learning 
data,  with  both  the  2-  and  6-  input  conditions  reaching  an  asymptote  of  93%  in  358  trials. 
Reaction  times  declined  substantially  over  trials,  with  the  2-  and  6-  input  functions  showing 
equivalent  learning  rates.  In  an  extended  study  of  human  learning  of  digital  gates  (Carlson  et 
al.  1988a)  subjects  took  about  500  trials  per  gate  or  3000  total  trials  to  bring  their  response 
times  below  .8  seconds.  When  responding  in  .8  seconds,  subjects  have  apparently  shifted  to  a 
strategy  of  direct  associative  retrieval  of  the  output  of  each  stage  given  its  input  (see  Carlson  et 
al.,  1988).  To  acquire  this  skill  of  automatic  retrieval  in  the  digital-logic  task,  subjects  require 
about  5  hours  of  practice  distributed  over  several  sessions. 

In  sharp  contrast  to  human  learning,  connectionist  learning  without  task  decomposition  re¬ 
quired  about  68,000  trials  to  learn  the  6-input  case.  Assuming  that  humans  take  about  6 
'econds  per  trial,  about  1 10  hours  would  be  needed  to  perform  68.000  trials.  This  is  far  more 
than  the  5  hours  humans  actually  required.  Even  of  greater  concern  than  this  long  learning 
time,  is  the  poor  scaling  shown  in  learning.  The  network  required  about  6  times  as  many  trials 
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to  learn  the  6-input  as  the  2-input  case.  The  dramatic  growth  in  the  number  of  training  trials 
suggests  such  a  network  could  not  learn  an  8-input  problem  in  the  lifetime  of  a  human. 

Connectionist  learning  with  task  decomposition  learned  the  6-input  case  in  about  3.200 
trials  and  scaled  fairly  well,  requiring  3  times  as  many  trials  as  the  2-input  case.  The  total 
number  of  trials  compares  reasonably  well  with  the  human  performance,  at  least  if  we  assume 
that  the  human  connectionist  processing  is  not  well  developed  until  humans  can  respond  below 
1  second.  Connectionist  learning  with  decomposition  learned  the  6-input  case  21  times  faster 
than  without  decomposition. 

The  above  results  suggest  that  combining  rule-based  and  connectionist  learning  may  pro¬ 
vide  the  best  of  both  types  of  computation.  Initial  rule-based  learning  (as  in  ACT*  and  SOAR) 
can  search  a  problem  space  and  decompose  a  task  into  subtasks  in  reasonable  amounts  of  time. 
Processing  in  this  rule-based  mode  is  slow,  serial,  and  effortful  as  is  a  human  novice  during 
the  controlled-processing  stage  of  skill  acquisition  (Shiffrin  &  Schneider  1977,  Schneider  & 
Detweiler  1987).  Practice  executing  the  rules  allows  connectionist  learning  to  map  the  inputs  to 
the  outputs  of  each  of  the  component  tasks.  The  early  rule-based  processing  decomposes  a 
task  so  that  smaller-scale  tasks  can  be  learned  with  connectionist  procedures.  This  decomposi¬ 
tion  must  identify  the  basic  stages  and  the  number  of  output  states  for  each  stage.  Once  tasks 
have  been  divided,  connectionist  learning  need  no  longer  perform  gradient  descent  search  in  the 
power  set  of  all  possible  connections,  but  rather  has  a  more  limited  problem  of  mapping  a  small 
number  of  input  states  of  each  component  task  to  a  small  number  of  output  states  for  each 
component  task.  This  use  of  task  decomposition  to  make  connectionist  learning  scale  reason¬ 
ably  is  an  approach  also  advocated  by  Minsky  (1988)  to  deal  with  the  combinatoric  explosion 
problem  that  occurs  as  task  complexity  increases. 

Some  readers  might  argue  that  our  example  provides  an  unfair  test  of  connectionst  learning 
and  that  our  conclusions  apply  to  only  a  limited  set  of  tasks.  We  will  briefly  discuss  four  criti¬ 
cisms  readers  may  have.  First,  the  problem  chosen  was  a  particularly  difficult  one  for  connec- 
tionist  learning,  since  it  included  three  levels  of  non-linearly  separable  problems  ' inputs ,  gates, 
negation).  We  grant  this,  but  it  is  a  real  task  that  humans  have  no  difficulty  performing  if  they 
are  instructed.  Learning  combinatoric  gates  is  still  a  toy  problem  and  one  that  must  be  solved 
by  any  model  of  human  learning.  Second,  by  instructing  humans  we  gave  away  the  answers. 
We  agree,  but  standard  connectionist  learning  provides  no  mechanism  for  instruction.  Since 
human  learning  can  improve  by  many  orders  of  magnitude  with  instruction,  it  is  important  to 
explore  architectures  that  can  benefit  from  instruction.  Third,  different  parameters  or  new 
learning  algorithms  may  greatly  speed  learning  in  the  present  task,  so  that  a  connectionist  pro¬ 
cedure  could  learn  the  6-input  condition  in  a  reasonable  number  of  trials.  Perhaps,  but  the 
critical  issue  is  whether  new  solutions  will  scale  well.  Task  division  and  use  or  rules  can  al¬ 
ways  be  used  to  reduce  the  scaling  problem  for  any  connectionist  procedure,  and  it  would  be 
surprising  if  human  learning  would  not  make  use  of  this  property  when  learning  new  tasks. 
Fourth,  the  present  study  shows  that  dividing  tasks  brings  about  faster  learning,  but  there  is  no 
demonstration  of  how  to  implement  the  task  decomposition  in  a  parsimonious  manner.  We  are 
currently  working  on  developing  such  an  architecture. 

We  are  developing  a  connecuonist/control  architecture  (Schneider  &  Detweiler  1987. 
Schneider  &  Mumme  1988,  Schneider  &  Oliver.  1988)  that  can  implement  rule-based  learning 
and  connectionist  learning  and  that  can  benefit  from  instruction  and  task  division.  The  archi¬ 
tecture  involves  connectionist  modules  that  transmit  vector  messages  among  modules.  The 
control  architecture  uses  an  attentional  gating  mechanism  that  can  modulate  the  transmission 
and  reception  ot  vectors  among  modules.  Each  module  outputs  information  to  the  controller, 
indicating  the  degree  of  module  activity  and  priority  of  its  message.  Controlled  processing  of 
the  rules  involves  altering  what  messages  are  transmitted  and  compared  in  the  network.  For 
example,  in  digital-gate  learning,  the  rule  would  be  of  the  form  "if  all  the  input  module  vectors 
match  the  lexical  vector  module  (which  contains  a  1);  then  transmit  the  "ALL Is"  code  to  the 
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output  of  the  input-coding  module".  Through  changes  in  attennonal  gating,  the  network  can  be 
reconfigured  to  execute  a  process  in  as  many  stages  as  is  required  to  perform  the  task. 
Intermediate  states  for  each  stage  are  represented  not  as  specif  c  units,  but  as  random  vectors. 

Learning  during  the  input-coding  stage  illustrates  how  rule-based  and  connectionist  learn¬ 
ing  interact  in  the  connecuomst/contro1  architecture.  The  instructions  to  the  model  indicate  that 
the  input  code  must  be  encoded  in  one  of  three  critical  states  and  all  the  inputs  map  to  these 
critical  states.  The  network  generates  three  random-state  vectors  and  associates  those  to  their 
respective  rules  (e.g..  Allis  =  A;  ALL0s=B,  \1LXED=C).  The  random  vectors  are  similar  to 
the  gensym  operator  in  LISP  programs.  During  practice,  the  rule-based  performance  correctlv 
solves  the  problem  by  serially  executing  the  rules.  On  each  trial  ,:ie  input  and  output  of  each 
stage  are  correctly  se.  via  the  rule-based  processing  (Schneider  &  Mumme,  1988). 
Connectionist  learning  alters  the  connection  weights  to  directly  map  the  input  to  the  output 
without  the  use  of  the  rule.  As  opposed  to  doing  a  gradient  descent  search  through  the 
connection  space  for  all  possible  output  codes,  the  network  needs  only  to  learn  how  to  map  the 
input  states  to  the  instructed  output  states. 

As  the  connectionist/control  architecture  learns  a  task,  processing  shifts  trom  sequential, 
rule-based  to  association-based  processing.  Each  module  associatively  maps  its  input  to  the 
output  and  this  process  cascades  over  a  number  of  stages.  This  connectionist  processing  has 
two  important  advantages  over  rule-based  processing.  First,  it  is  faster,  because  information  is 
retrieved  associatively.  Second,  it  is  not  as  brittle  as  rule-based  processing  because  the  mutual 
constraint  match  property  of  connectionist  mapping  will  map  the  input  to  its  ciosest  matching 
output,  Phis  may  provide  better  generalization  when  the  rule  knowledg.  is  ambiguous.  The 
model  follows  the  changes  in  human  skilled  performance  as  practice  continues  (Schneider  & 
Detweiler  1987;  Schneider  &  Mumme,  1988). 

Summary 

We  have  provided  an  illustration  of  the  scaling  problem  exhibited  by  backpropagation  when 
required  to  solve  a  modestly  complex  task.  We  have  shown  that  humans,  if  they  are  given  in¬ 
struction  on  the  digital-logic  task,  show  no  effect  of  scale  when  the  number  of  inputs  to  be 
learned  was  increased.  The  humans  learned  the  most  complex  task  220  times  faster  (in  terms 
ot  trials)  than  the  connectionist  simulation.  We  also  evaluated  a  model  using  a  task  decompo¬ 
sition  exhibited  by  the  human  subjects.  Connectionist  learning  of  the  decomposed  tasks  scaled 
reasonably  in  this  model,  learning  21  times  faster  than  the  model  without  task  decomposition 
tor  the  6-input  case.  We  speculated  that  hybrid  architectures  provide  a  superior  proce ;,sing  en¬ 
vironment  than  either  purely  rule-based  or  connectionst  processing  environments.  The  hybrid 
approach  appears  to  scale  well  and  to  learn  at  rates  comparable  to  humans. 
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